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Abstract: We implemented fast Gaussian gridding (FGG)-based non- 
uniform fast Fourier transform (NUFFT) on the graphics processing unit 
(GPU) architecture for ultrahigh-speed, real-time Fourier-domain optical 
coherence tomography (FD-OCT). The Vandermonde matrix-based non- 
uniform discrete Fourier transform (NUDFT) as well as the linear/cubic 
interpolation with fast Fourier transform (InFFT) methods are also 
implemented on GPU to compare their performance in terms of image 
quality and processing speed. The GPU accelerated InFFT/NUDFT/NUFFT 
methods are applied to process both the standard half-range FD-OCT and 
complex full-range FD-OCT (C-FD-OCT). GPU-NUFFT provides an 
accurate approximation to GPU-NUDFT in terms of image quality, but 
offers >10 times higher processing speed. Compared with the GPU-InFFT 
methods, GPU-NUFFT has improved sensitivity roll-off, higher local 
signal-to-noise ratio and immunity to side-lobe artifacts caused by the 
interpolation error. Using a high speed CMOS line- scan camera, we 
demonstrated the real-time processing and display of GPU-NUFFT-based 
C-FD-OCT at a camera-limited rate of 122 k line/s (1024 pixel/A- scan). 
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1. Introduction 

Fourier-domain optical coherence tomography (FD-OCT) is capable of providing depth/time- 
resolved images of biological tissues noninvasively with micron level resolution. These 
features make FD-OCT systems suitable for applications in microsurgical guidance and 
intervention [1-6]. For practical clinical applications, FD-OCT systems require raw data 
acquisition, data processing, and visualization rate in excess of 10's of kHz speed. 

The required raw data acquisition rate (A-scan line rate) of FD-OCT systems has been 
achieved in recent years where > 100,000 line/s speeds are common [7-14]. Even > 1,000,000 
line/s rate has been achieved for multi-channel FD-OCT [15]. The speed of FD-OCT is 
determined either by the line rate of CMOS/CCD camera (for spectrometer-based systems) or 
by the spectral sweeping frequency (for swept laser-based systems). Ultrahigh acquisition 
speed is essential for time-resolved 4D recording of dynamic biological processes such as eye 
blinking, papillary reaction to light stimulus [10,11], and embryonic heart beating [12-14]. 

However, data processing and image rendering/display speed have not kept up with the 
ever increasing rate of data acquisition; this became one of the limiting factors for employing 
FD-OCT systems in practical microsurgery guidance and intervention applications. Currently, 
for most FD-OCT systems, the raw data is acquired in real-time and saved for post- 
processing. For microsurgeries, such imaging protocol provides valuable "pre-operative/post- 
operative" images, but is incapable of providing real-time, "inter- operative" imaging for 
surgical guidance and visualization. In addition, standard FD-OCT systems suffer from 
spatially reversed complex-conjugate ghost images that could severely misguide the users. As 
a solution, the complex full-range FD-OCT (C-FD-OCT) has been utilized, which removes 
the complex-conjugate image by applying a phase modulation on interferogram frames 
[16-20]. A 140k line/s 2048-pixel C-FD-OCT has been implemented for volumetric anterior 
chamber imaging [20]. However, the complex-conjugate processing is even more time- 
consuming and presents an extra burden when providing real-time images during surgical 
procedures. 

Several methods have been implemented to improve data processing and visualization of 
FD-OCT images: Field-programmable gate array (FPGA) has been applied to both 
spectrometer and swept source-based systems [21,22]; multi-core CPU parallel processing has 
been implemented and achieved 80,000 line/s processing rate on nonlinear-k polarization- 
sensitive OCT system and 207,000 line/s on linear-k systems, both with 1024-point/A-scan 
[23,24]. Moreover, recent progress in general-purpose computing on graphics processing units 
(GPGPU) makes it possible to implement heavy-duty OCT data processing and visualization 
on a variety of low-cost, many-core graphics cards. 

For standard half-range FD-OCT, GPU-based data processing has been implemented on 
both linear-k and non-linear-k systems [25-27]. Real-time 4D OCT imaging has also been 
achieved up to 10 volume/s through GPU-based volume rendering [24,26]. We have found 
that, based on a 1024-pixel standard FD-OCT system using NVIDIA's GTX 480 GPU, the 
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maximum processing line rate is >3, 000,000 line/s (effectively > 1,000,000 line/s under data 
transfer limit), which will be shown in detail in Section 4. 

For complex full-range FD-OCT, the processing workload is more than 3 times the 
standard OCT, since each A-scan requires three fast Fourier transforms (FFT) in different 
dimensions of the frame, a band-pass filtering, and necessary matrix transpose [16]. In a 
separate work, we realized a real-time processing of 1024-pixel C-FD-OCT at >500,000 line/s 
(effectively >300,000 line/s under data transfer limit), and a real-time camera-limited display 
speed of 244,000 line/s [28]. Very recently, a 27,900 line/s 2048-pixel C-FD-OCT system was 
reported by Watanabe et al during the preparation of this manuscript [29]. 

In most FD-OCT systems, the signal is sampled nonlinearly in k-space, which will 
seriously degrade the image quality if the FFT is directly applied to such signal. So far there 
have been both hardware and software solutions to the nonlinear-k issue. Hardware solutions 
such as linear-k spectrometer [30], linear-k swept laser [31] and k-triggering [32] have been 
successfully implemented, but these methods generally increase the system complexity and 
cost. Software solutions include various interpolation methods such as simple linear 
interpolation, oversampled linear interpolation, zero-filling linear interpolation, and cubic 
spline interpolation. Different GPU-based interpolation methods have also been implemented 
and compared [25-27]. Alternatively, the non-uniform discrete Fourier transform (NUDFT) 
has been proposed recently for both swept source OCT [33] and spectrometer-based OCT [34] 
through direct Vandermonde matrix multiplication with the spectrum vector. Compared with 
the interpolation-FFT (InFFT) method, NUDFT is simpler to implement and immune to the 
interpolation-caused errors such as increased background noise and side-lobes, especially at 
larger image depth [35]. Moreover, NUDFT has improved sensitivity roll-off than the InFFT 
[34]. However, NUDFT by direct matrix multiplication is extremely time-consuming, with a 
complexity of 0(N 2 ), where N is the raw data size of an A-scan. As an approximation to 
NUDFT, the gridding-based non-uniform fast Fourier transform (NUFFT) has been tried to 
process simulated [36] and experimentally acquired data [35] with reduced calculation 
complexity of -O(NlogN). To the best of our knowledge, NUDFT/ NUFFT have yet to be 
utilized in ultra-high speed, real-time FD-OCT systems due to computational complexity and 
associated latency in data processing. 

In this work, we implemented the fast Gaussian gridding (FGG) -based NUFFT on the 
GPU architecture for ultrafast signal processing in a general FD-OCT system. The 
Vandermonde matrix-based NUDFT as well as the linear/cubic InFFT methods are also 
implemented on GPU as comparisons of image quality and processing speed. GPU-NUFFT 
provides a very close approximation to GPU-NUDFT in terms of image quality while offering 
>10 times higher processing speed. Compared with the GPU-InFFT methods, we have also 
observed improved sensitivity roll-off, a higher local signal-to-noise ratio, and absence of 
side-lobe artifacts in GPU-NUFFT. Using a high speed CMOS line-scan camera, we 
demonstrated the real-time processing and display of GPU-NUFFT-based C-FD-OCT at a 
camera-limited speed of 122 k line/s (1024 pixel/A- scan). 

2. System configuration 

The FD-OCT system used in this work is spectrometer-based, as shown in Fig. 1. A 12-bit 
dual-line CMOS line-scan camera (Sprint spL2048-140k, Basler AG, Germany) works as the 
detector of the OCT spectrometer. A superluminescence diode (SLED) (X 0 = 825nm, AX = 
70nm, Superlum, Ireland) was used as the light source, which provided a theoretical axial 
resolution of approximately 5.5um in air, 4.1um in water. Since the SLED's spectrum only 
covers less than half of the CMOS camera's sensor array, the camera is set to work at 1024- 
pixel mode by selecting the area-of-interest (AOI). The camera works at the "dual-line 
averaging mode" to get 3dB higher SNR of the raw spectrum [7], and the minimum line 
period is camera- limited to 7.8jis, which corresponds to a maximum line rate of 128k A- 
scan/s. The beam scanning was implemented by a pair of high speed galvanometer mirrors 
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driven by a dual channel function generator and synchronized with a high speed frame 
grabber (PCIE-1429, National Instruments, USA). For C-FD-OCT mode, a phase modulation 
is applied to each B-scan's 2D interferogram frame by slightly displacing the probe beam off 
the first galvanometer's pivoting point (here only the first galvanometer is illustrated in the 
figure) [17,18]. The scanning angle is 8° and the beam offset is 1.4mm, therefore a carrier 
frequency of 30 kHz is obtained according to reference [18]. The transversal resolution was 
approximately 20jim, assuming Gaussian beam profile. A quad-core Dell T7500 workstation 
was used to host the frame grabber (PCIE x4 interface) and GPU (PCIE xl6), and a GPU 
(NVIDIA GeForce GTX 480) with 480 stream processors (each processor working at 
1.45GHz) and 1.5GBytes graphics memory was used to perform the FD-OCT signal 
processing. The GPU is programmed through NVIDIA' s Compute Unified Device 
Architecture (CUD A) technology [37]. The FFT operation is implemented by the CUFFT 
library [38]. The software is developed under Microsoft Visual C++ environment with the 
NI-IMAQ Win32 API (National Instrument). 




Fig. 1. System configuration: CMOS, CMOS line scan camera; L, spectrometer lens; G, 
grating; CI, C2, C3, achromatic collimators; C, 50:50 broadband fiber coupler; CL, camera 
link cable; COMP, host computer; GPU, graphics processing unit; PCIE, PCI Express xl6 
interface; MON, Monitor; GV, galvanometer (only the first galvanometer is illustrated for 
simplicity); SL, scanning lens; DCL, dispersion compensation lens; M, reference mirror; PC, 
polarization controller; SP, Sample. 

3. Implementation of GPU-NUDFT and GPU-NUFFT in FD-OCT systems 

In this section, the implementation of both GPU-NUDFT and GPU-NUFFT in a standard FD- 
OCT system is described. For the implementation, the wavenumber-pixel relation of the 
system k[i] = In I /l[i] is pre-calibrated accurately, where i refers to the pixel index. 

3.1 GPU-NUDFT in FD-OCT 

After pre-calibrating the k[i] relation, the depth information A[z m ] can be implemented 
through discrete Fourier transform over non-uniformly distributed datalfkj , as in [33-35], 



A[zJ = £l[kJexp 



-j^(V k o)* m 
Ak 



,m=0,l,2,...,N-l, 



(1) 
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where N is the total pixel number of a spectrum, z m refers to the depth coordinate with the 
pixel index m. Ak = k N1 -k 0 is the wavenumber range. 

For standard FD-OCT, where I[kJ are real-values, Eq. (1) can be reduce to half-range as, 



A[zJ = £l[kJexp 



2n 

-j — (k r k 0 )*m 
Ak 



, m=0,l,2,...,N/2-l. 



(2) 



For C-FD-OCT, where I[kJ are complex- values after Hilbert transform [16], Eq. (1) can 
be modified to full-range as, 



A[z m ] = £l[k i ]exp 



-j — (k.-kj* m 

J Ak 1 0 I 2 



, m=0,l,2,...,N-l. 



(3) 



Here the index m is shifted by N/2 to set the DC component to the center of A[z m ] . 
Considering a frame with M A-scans, Eq. (2) and (3) can be written in matrix form for 
processing the whole frame as, 



where 



Ayr = D half I real , (Standard half-range FD-OCT) , 



(4) 



A half = 



A 0 [z«] 
A 0 [z,] 



A,[z 0 ] 
AJzJ 



Ao [ Z N/2-2] A 1 [Z^ v/2 _ 2 ] 
A 0 [ Z ^/2-l] AJZ^^ J 

I 0 [z 0 ] Iitz 0 ] 
I 0 [zJ IJzJ 

I 0 [z N _ 2 ] Ii[z N _ 2 ] 

.I 0 [ Z N-l] ^N-J 



1 

Po 1 



1 



-(N/2-2) -(N/2-2) 

Po Pi 



-(N/2-1) 



-(N/2-1) 



^M-2t Z o] 
A M _ 2 [Zj ] 



Am.JZq] 
A M-l[ Z J 



A M-2 [ Z N/2-2 ] ^M-A Z N/2-2^ 
A M-2 t Z A>72-l ] A M-1 t Z A^/2-l ] 

^M-2t Z o] Im-i[ Z o] 

^M-2t Z N-2] ^M-lt Z N-2] 
^M-2t Z N-J Im-i[ Z N-J. 



1 



1 

-1 

Pn-i 



-(N/2-2) -(N/2-2) 

Pn-2 Pn-i 

(N/2-1) -(N/2-1) 

Pn-2 Pn-i 



(5) 



(6) 



(7) 



and 



A full = Dfunlcompiex > (Standard half-range FD-OCT) , 



(8) 



where 
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A 0 [z 0 ] 
A 0 [zJ 



A,[z 0 ] 
A^zJ 



A M . 2 [z 0 ] A M1 [z 0 ] 
^M-2t z J A M1 [zJ 



A M _ 2 [ z i] 



A 0 [z N _ 2 ] A 1 [z N _ 2 ] 
A^z^J A 1 [z n _ 1 ] 



complex 



D M i = 





Ii[z 0 ] 


w 


IitzJ 




^[Zn^] 




^[Zn-J 


" n +(N/2) 

Po 


n +(N/2) 
Pi 


+(N/2-l) 

Po 


+(N/2-l) 
Pi 


-(N/2-2) 

Fo 


-(N/2-2) 
Pi 


-(N/2-1) 

_Po 


-(N/2-1) 
Pi 



Am-2[Z n _ 2 ] ^M-l[Z N _ 2 ] 

\-2[ Z N-J A M1 [Z N1 ] 

^M-2t Z 0 ] I M1 [Z Q ] 

W Z J Im-iEzJ 

^M-2t Z N _ 2 ] I M1 [Z N 2 ] 

^M-2[Z N _J I M1 [Z N J 



(9) 



(10) 



n +(N/2) 
Pn-2 
+(N/2-l) 
Pn-2 



-(N/2-2) 

Pn-2 

-(N/2-1) 

Pn-2 



n +(N/2) 

Pn-i 

+(N/2-l) 

Pn-2 



-(N/2-2) 

Pn-i 

-(N/2-1) 

Pn-i 



(11) 



where the subscript of A[z m ] and 1^ ] denotes the index of A-scan within one frame, the 

complex factor p { = exp[j27i/ Ak*(k i -k 0 )] , D half and D M are the Vandermonde matrix, 

which can be pre-calculated from k[i] . 

To realize C-FD-OCT mode, a phase modulation ^(x) = Pxis applied to each B-scan's 2D 

interferogram frame I(k,x) by slightly displacing the probe beam off the galvanometer's 

pivoting point, as shown in Fig. 1. Here x indicates A-scan index in each B-scan and by 
applying Fourier transform along x direction, the following equation can be obtained [16]: 

j2< 



F_ u [I(k,x)] = |E r (k)| z 8(u) 

+ r u {F_JE s (k,x)]} 

+ F x ^[E;(k,x)E r (k)](x)5(u+P) 

+ F x ^ u [E s (k,x)E;(k)]®5(u-P), 



(12) 



where E s (k,x) and E r (k) are the electrical fields from the sample and reference arms, 

respectively. T u {} is the correlation operator. The first three terms on the right hand of 

Eq. (12) present the DC noise, autocorrelation noise, and complex-conjugate noise, 
respectively. The last term can be filtered out by a proper band-pass filter in the u domain and 
then convert back to x domain by applying an inverse Fourier transform along x direction. 
Here to implement the standard Hilbert transform, we use the Heaviside step function as the 
band-pass filter and the more delicate filters such as super Gaussian filter can also be designed 
to optimize the performance [29] . Finally, the OCT image is obtained by NUDFT in k domain 
and logarithmically scaled for display. 



#135184 - $15.00 USD Received 17 Sep 2010; revised 14 Oct 2010; accepted 15 Oct 2010; published 22 Oct 2010 
(C) 2010 OSA 25 October 2010 / Vol. 18, No. 22 / OPTICS EXPRESS 23478 



Thread 3 




Display 



Fig. 2. Processing flowchart for GPU-NUDFT based FD-OCT: CL, Camera Link; FG, frame 
grabber; HM, host memory; GM, graphics global memory; DC, DC removal; MT, matrix 
transpose; FFT-x, Fast Fourier transform in x direction; IFFT-x, inverse Fast Fourier transform 
in x direction; BPF-x, band pass filter in x direction; Log, logarithmical scaling. The solid 
arrows describe the main data stream and the hollow arrows indicate the internal data flow of 
the GPU. The blue dashed arrows indicate the direction of inter-thread triggering. The hollow 
dashed arrow denotes standard FD-OCT without the Hilbert transform in x direction. Blue 
blocks: memory for pre-stored data; Yellow blocks: memory for real-timely refreshed data. 



The GPU-CPU hybrid processing flowchart for standard/complex FD-OCT using GPU- 
NUDFT is shown in Fig. 2, where three major threads are used for the data acquisition 
(Thread 1), the GPU processing (Thread 2), and the image display (Thread 3), respectively. 
The three threads are triggered unidirectionally and work in the synchronized pipeline mode, 
as indicated by the blue dashed arrows in Fig. 2. Thread 2 is a GPU-CPU hybrid thread which 
consists of hundreds of thousands of GPU threads. The solid arrows describe the main data 
stream and the hollow arrows indicate the internal data flow of the GPU. The DC removal is 
implemented by subtracting a pre-stored frame of reference signal. The Vandermonde matrix 
D half / D m is pre-calculated and stored in graphics memory, as the blue block in Fig. 2. The 

NUDFT is implemented by an optimized matrix multiplication algorithm on CUDA using 
shared memory technology for the maximum usage of the GPU's floating point operation 
ability [39]. The graphics memory mentioned at the current stage of our system refers to 
global memory, which has relatively lower bandwidth and is another major limitation of GPU 
processing in addition to the PCIE xl6 bandwidth limit. The processing speed can be further 
increased by mapping to texture memory, which has higher bandwidth than global memory. 

3.2 GPU-NUFFT in FD-OCT 

The direct GPU-NUDFT presented in Section 3.1 has a computation complexity of 0(N 2 ), 
which greatly limits the computation speed and scalability for real-time display even on GPU, 
as experimentally shown in Section 4. Alternative to direct GPU-NUDFT, here we 
implemented fast Gaussian gridding-based GPU-NUFFT to approximate GPU-NUDFT: the 
raw signal I[k A ] is first oversampled by convolution with a Gaussian interpolation kernel on a 
uniform grid, as [40], 



IJu] = £l[i]g t [k t [u]-k[i]] , u=0,l,2,...,Mr-l, 



(13) 



g t [k] = exp 



-k 2 



(14) 



Ax 
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_L 71 

N 2 R(R-0.5) 



G T [n] = exp[n 2 x] , 



(15) 
(16) 



where k x [u] is uniform grid covering the same range as k[i] , g T [k] is the Gaussian 

interpolation kernel, M r = R* N is the uniform gridding size, and R is the oversampling ratio. 

On each k x [u] grid, the source data I[i] is selected such that k x [u] is within the nearest 

2*M sp grids tok[i] , where M sp is the kernel spread factor. The calculation of Eq. (13) is 

illustrated in Fig. 3, where we set R = 2, M sp = 2 and values of the blue dots on the same 

k x [u] grid are summed together to obtain I T [u] . The selection of proper R and M sp will be 

further discussed in Section 4.2. Subsequently I T [u]is processed by the regular uniform FFT 

and finally undergo a deconvolution of g x [k] by multiplying the factor G T [n] , as in Eq. (16), 

where n denotes the axial index. If R>1, there will be redundant data due to oversampling and 
the final data needs to be truncated to the size of N. The processing flowchart of GPU- 
NUFFT-based FD-OCT is shown in Fig. 4. 

Here it is worth noting that the Kaiser-Bessel function is found to be the optimal 
convolution kernel for the gridding-based NUFFT shown in recent works [35,36]. The 
implementation of Kaiser-Bessel convolution on GPU is similar to the Gaussian kernel and 
will be studied and evaluated in our future work. 




k T [0] k T [l] k,[2] k T [3] k,[4] k,[5] k,[6] k,[7] k T [8] k,[9] k T [10] kjll] 



Fig. 3. Convolution with a Gaussian interpolation kernel on a uniform grid when R = 2, 
M SD = 2. 
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Fig. 4. Processing flowchart for GPU-NUFFT based FD-OCT: CL, Camera Link; FG, frame 
grabber; HM, host memory; GM, graphics global memory; DC, DC removal; MT, matrix 
transpose; CON; convolution with Gaussian kernel; FFT-x, Fast Fourier transform in x 
direction; IFFT-x, inverse Fast Fourier transform in x direction; BPF-x, band pass filter in x 
direction; FFT-k r , FFT in k r direction; TRUC, truncation of redundant data in k r direction; 
DECON, deconvolution with Gaussian kernel; Log, logarithmical scaling. The blue dashed 
arrows indicate the direction of inter-thread triggering. The solid arrows describe the main data 
stream and the hollow arrows indicate the internal data flow of the GPU. The hollow dashed 
arrow denotes standard FD-OCT without the Hilbert transform in x direction. 



4. GPU processing test and comparison of different FD-OCT methods 

4.1 GPU processing line rate for different FD-OCT methods 

First we performed benchmark line rate test of different FD-OCT processing methods as 
follows: 

LIFFT: Standard FD-OCT with linear spline interpolation; 
LIFFT-C: C-FD-OCT with linear spline interpolation; 
CIFFT: Standard FD-OCT with cubic spline interpolation; 
CIFFT-C: C-FD-OCT with cubic spline interpolation; 
NUDFT: Standard FD-OCT with NUDFT; 
NUDFT -C: C-FD-OCT with NUDFT; 
NUFFT: Standard FD-OCT with NUFFT; 
NUFFT -C: C-FD-OCT with NUFFT; 



All algorithms are tested on the GTX 480 GPU with 4096 lines of both 1024-pixel 
spectrum and 204 8 -pixel spectrum. Here the 204 8 -pixel mode is tested as reference only and 
we will use the 1024-pixel mode for the real-time imaging tests in this work. For each case, 
both the peak internal processing line rate and the reduced line rate considering the data 
transfer bandwidth of PCIE xl6 interface are listed in Fig. 5. The processing time is measured 
using CUDA functions and all CUDA threads are synchronized before the measurement. In 
our system, the PCIE xl6 interface between the host computer and the GPU has a data 
transfer bandwidth of about 4.6 GByte/s for both transfer-in and transfer-out. The data size is 
8.4 MByte for 4096 lines of 1024-pixel spectrum and 16.8 MByte for 4096 lines of 2048-pixel 
spectrum, by allocating 2 Byte for each pixel. 
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Fig. 5. Benchmark line rate test of different FD-OCT processing method, (a) 1024-pixel FD- 
OCT; (b) 2048-pxiel FD-OCT; (c) 1024-pixel NUFFT-C with different frame size; (d) 2048- 
pixel NUFFT-C with different frame size; Both the peak internal processing line rate and the 
reduced line rate considering the data transfer bandwidth of PCIE xl6 interface are listed. 

As in Fig. 5(a), the final processing line rate for 1024-pixel C-FD-OCT with GPU-NUFFT 
is 173k line/s, which is still higher than the maximum camera acquisition rate of 128k line/s, 
while the GPU-NUDFT speed is relatively lower in both standard and complex FD-OCT. 
Also it is notable that the processing speed of LIFFT goes up to >3, 000,000 line/s (effectively 
> 1,000,000 line/s under data transfer limit), achieving the fastest processing line rate to date to 
the best of our knowledge. 

For C-FD-OCT, the Hilbert transform, which is implemented by two Fast Fourier 
transforms, has the computational complexity of ~0(M*logM), where M is the number of 
lines within one frame. Therefore the processing line rate of C-FD-OCT is also influenced by 
the frame size M. To verify this, here we tested the relation between processing line rate of 
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NUFFT-C mode versus frame size, as shown in Fig. 5(c) and 5(d), and a speed decrease is 
observed with increasing frame size. 

The zero-filling interpolation with FFT is also effective in suppressing the side-lobe effect 
and background noise for FD-OCT. However, the zero-filling usually require an oversampling 
factor of 4 or 8, and two additional FFT, which considerably increase the array size and 
processing time of the data [41]. From Fig. 5(b), for 2048-pixel NUFFT-C mode, a peak 
internal processing speed of about 100,000 line/s is achieved. This is >40% faster than 70,000 
line/s (1024 lines for 14.75ms) achieved in reference [29], which implemented GPU-based 
zero-filling interpolation for C-FD-OCT using a 480-core GPU (NVIDIA Geforce GTX 295, 
2 x 240 core, 1.3GHz for each core). 

4.2 Comparison of point spread function and sensitivity roll-off 

Then we compared the point spread function (PSF) and sensitivity roll-off of different FD- 
OCT processing methods, as shown in Fig. 6(a)-6(d), using GPU based LIFFT, CIFFT, 
NUDFT and NUFFT, respectively. A mirror is used as an image sample for evaluating the 
PSF. Here 1024 of GPU-processed 1024-pixel A-scans are averaged to present the mean noise 
level for the PSF in each axial position [35]. From Fig. 6(a), it is noticed that using LIFFT 
method introduces significant background level and side-lobes around the signal peak. The 
side-lobes tend to be broad and extend to their neighboring peaks, which results in a 
significant degrade of OCT image. When CIFFT is used instead, as shown in Fig. 6(b), the 
side-lobes are suppressed in the shallow image depth, but still considerably high in deeper 
depth. Figure 6(c) and 6(d) shows significant suppression of side-lobes over the whole image 
depth which utilized NUDFT and NUFFT, respectively. Figure 6(e) presents an individual 
PSF at a certain depth close to the imaging depth limit (intensity of each A-scan is 
normalized). Both LIFFT and CIFFT have large side-lobes, while both NUDFT and NUFFT 
have flat background and no noticeable side-lobes, which also indicate a higher local SNR at 
deeper axial positions. The measured sensitivity roll-off is summarized in Fig. 6(f). All 
methods have a very close sensitivity level at shallow image depth <400jim, while the 
interpolation based methods presents faster roll-off rate as axial position is increased. Here the 
sensitivity roll-off of LIFFT and CIFFT can be further improved by deconvolution of the 
linear/cubic interpolation kernel, and the result would be more close to NUDFT and NUFFT 
[35,36]. It is also notable from Fig. 6(e) and 6(f) that the NUFFT method exhibit negligible 
difference from the more time-consuming NUDFT, indicating a very close and reliable 
approximation. The full width at half maximum (FWHM) values of individual A-scans 
processed by different methods are presented in Fig. 6(g), which indicates very close axial 
resolution for all methods. 

Figure 6(h) presents the processing result of a certain A-scan using NUFFT with R = 2 and 
different M sp from 1 to 3. Here R is set to 2 for the convenience of applying GPU-based FFT 

functions (for length of 2 N ). From Fig. 6(h), it can be noticed that forM sp > 2 , the NUFFT 

result is close enough to NUDFT result, therefore M sp = 2 is selected to decrease the 

computational load. 
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Fig. 6. Point spread function and sensitivity roll-off of different processing methods: (a) 
LIFFT; (b) CIFFT; (c) NUDFT; (d) NUFFT; (e) Comparison of PSF at certain image depth 
using different processing; (f) Comparison of sensitivity roll-off using different processing 
methods; (g) A-scan FWHM with depth; (h) Performance of NUFFT with different M sp values. 

4.3 Comparison of real-time image quality 

We compared the real-time image quality of GPU- accelerated C-FD-OCT using different 
processing methods. Here a multi-layered polymer phantom is used as a sample. In the 
scanning protocol, each frame consists of 4296 A-scans in acquisition, but the first 200 lines 
are disposed before processing, since they are within the fly-back period of the galvanometer. 
Therefore each frame-size is 4096 pixel (lateral) x 1024 pixel (axial). 
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Fig. 7. Real-time image of multilayered phantom using different processing methods, where the 
bars represent 1mm in both dimensions for all images: (a) LIFFT (Media 1, 29.8 fps); (b) 
CIFFT (Media 2, 29.8 fps); (c) NUDFT (Media 3, 9.3 fps); (d) NUFFT (Media 4, 29.8 fps). All 
images are originally 4096 pixel (lateral) x 1024 pixel (axial) and rescaled to 1024 pixel 
(lateral) x 512 pixel (axial) for display on the monitor. (e)~(h): Magnified view corresponding 
to the blue-boxed area in (a)~(d). ZDL: zero delay line. The red arrows in (a) and (b) indicate 
the ghost image due to the presence of side-lobes of the reflective surface at a large image 
depth relative to ZDL. The red lines correspond to the A-scans extracted from the same lateral 
position of each image, shown collectively in (i). The side-lobes of LIFFT/CIFFT are indicated 
by the blue arrow in (i). 

First, a single frame is captured and GPU-processed using different methods, shown in 
Fig. 7. The red arrows in Fig. 7(a) and 7(b) indicate the ghost image due to the side-lobes of 
the reflective surface at a deeper image depth relative to the zero delay line (ZDL). The red 
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lines correspond to the same A-scan position extracted from each image for comparison and 
shown collectively in (i). The resulting LIFFT/CIFFT images exhibit side-lobes in the order of 
10/5 dB high compared to NUDFT/NUFFT images, as indicated by the blue arrow in 
Fig. 7(i). 

Then we screen-captured the real-time displayed scenarios using different GPU- 
accelerated C-FD-OCT, shown in Media 1, Media 2, Media 3, and Media 4. The image 
frames are rescaled to 1024 pixel (lateral) x 512 pixel (axial) to accommodate the monitor 
display. LIFFT/CIFFT/NUFFT modes are running at 29.8 fps, corresponding to a camera- 
limited line rate of 122k line/s, while the NUDFT mode is GPU-limited to 9.3 fps (38k line/s). 
As shown in both Fig. 7 and the corresponding movies, in LIFFT/CIFFT modes, the ghost 
image is evident when the phantom surface moves further away from the ZDL. However, 
there is no evidence of ghost images in NUDFT/NUFFT modes wherever the surface is 
placed. 

From these result, it is clear that the GPU-NUFFT method is a very close approximation 
of GPU-NUDFT while offering much higher processing speed. GPU-NUFFT can be achieved 
at a comparable processing speed to GPU-CIFFT and is immune to interpolation error-caused 
ghost images. 

5. In vivo human finger imaging using GPU-NUFFT based C-FD-OCT 

Finally we conducted the in vivo human finger imaging using GPU-NUFFT-based C-FD- 
OCT, displayed at 29.8fps with original frame size of 4096 pixel (lateral) x 1024 pixel (axial). 
Figure 8(a) and 8(b) present the coronal scans of the finger tip and palm, where the epithelial 
structures such as sweat duct (SD), stratum corneum (SC) and stratum spinosum (SS) are 
clearly distinguishable. Figure 8(c) and 8(d) present the coronal scans of the finger nail fold 
region, showing the major dermatologic structures such as epidermis (E), dermis (D), nail bed 
(NB), and nail root (NR), as well as in the sagittal scans in Fig. 8(e) and 8(f). The real-time 
display for each figure is captured as Media 5, Media 6, Media 7, Media 8, and Media 9, at 
1024 pixel (lateral) x 512 pixel (axial). Compared to standard FD-OCT, the GPU-NUFFT 
C-FD-OCT image is free of conjugate artifact, DC noise, and autocorrelation noise. These 
noises are problematical to remove in standard FD-OCT. Moreover, due to the 
implementation of the complex OCT processing, the image depth is effectively doubled, with 
the highest SNR region in the zero delay point. Such ultra high-speed real-time C-FD-OCT 
could be highly useful for microsurgical guidance and intervention applications. 
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Fig. 8. Real-time C-FD-OCT images using GPU-NUFFT, where the bars represent 1mm in 
both dimensions for all images: (a) (Media 5) Finger tip, (coronal), (b) (Media 6) Finger palm 
(coronal). (c)~(d) (Media 7) Finger nail fold (coronal); (e)~(f) (Media 8, Media 9) Finger nail 
(sagittal). SD, sweat duct; SC, stratum corneum; SS, stratum spinosum; NP, nail plate; NB, nail 
bed; NR, nail root; E, epidermis; D, dermis. 

6. Conclusion 

In this work, we implemented and successfully demonstrated the FGG-based NUFFT on the 
GPU architecture for online signal processing in a general FD-OCT system. The 
Vandermonde matrix-based NUDFT as well as the linear/cubic InFFT methods were also 
implemented on GPU as comparisons of image quality and processing speed. GPU-NUFFT 
provides an accurate approximation to GPU-NUDFT in terms of image quality while offering 
>10 times higher processing speed. Compared to the GPU-InFFT methods, GPU-NUFFT has 
better sensitivity roll-off, a higher local signal-to-noise ratio, and is immune to the side-lobe 
artifacts caused by interpolation error. Using a high speed CMOS line-scan camera, we 
demonstrated the real-time processing and display of GPU-NUFFT-based C-FD-OCT at a 
camera-limited speed of 122 k line/s (1024 pixel/A- scan). The GPU processing speed can be 
increased even higher by implementing a multiple-GPU architecture using more than one 
GPU in parallel. 
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